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ABSTRACT 

This  paper  describes  a  system  for  rapidly  retargetable  interactive 
translingual  retrieval.  Basic  functionality  can  be  achieved  for  a  new 
document  language  in  a  single  day,  and  further  improvements  re¬ 
quire  only  a  relatively  modest  additional  investment.  We  applied 
the  techniques  first  to  search  Chinese  collections  using  English  queries, 
and  have  successfully  added  French,  German,  and  Italian  document 
collections.  We  achieve  this  capability  through  separation  of  language- 
dependent  and  language-independent  components  and  through  the 
application  of  asymmetric  techniques  that  leverage  an  extensive  En¬ 
glish  retrieval  infrastructure. 
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1.  INTRODUCTION 

Our  goal  is  to  produce  systems  that  allow  interactive  users  to  present 
English  queries  and  retrieve  documents  in  languages  that  they  can¬ 
not  read.  In  this  paper  we  focus  on  what  we  call  “rapid  retargetabil- 
ity”:  extending  interactive  translingual  retrieval  functionality  for  a 
new  document  language  rapidly  with  few  language-specificresources. 
Our  current  system  can  be  retargeted  to  a  new  language  in  one  day 
with  only  one  language-dependent  resource:  a  bilingual  term  list.  ’ 
Our  language-independent  architecture  consists  of  two  main  com¬ 
ponents: 

1 .  Document  translation  and  indexing 

2.  Interactive  retrieval 

We  describe  each  of  these  components,  demonstrate  their  effective¬ 
ness  for  information  retrieval  tasks,  and  then  conclude  by  describing 
our  experience  with  adding  French,  German  and  Italian  document 
collections  to  a  system  that  was  originally  developed  for  Chinese. 

^  For  Asian  languages  we  also  use  a  language-specific  segmentation 
system. 


2.  DOCUMENT  TRANSUATION  AND  IN¬ 
DEXING 

We  have  adopted  a  document  translation  architecture  for  two  rea¬ 
sons.  First,  we  support  a  single  query  language  (English)  but  multi¬ 
ple  document  languages,  so  indexing  English  terms  simplifies  query 
processing  (where  interactive  response  time  can  be  a  concern).  Sec¬ 
ond,  a  document  translation  architecture  simplifies  the  display  of 
translated  documents  by  decoupling  the  translation  and  display  pro¬ 
cesses.  Gigabyte  collections  require  machine  translation  that  is  or¬ 
ders  of  magnitude  faster  than  present  commercial  systems.  We  ac¬ 
complish  this  using  term-by-term  translation,  in  which  the  basic  data 
structure  is  a  simple  hash  table  lookup.  Any  translation  requires 
some  source  of  translation  knowledge — we  use  a  bilingual  term  list 
containing  English  translation(s)  for  each  foreign  language  term.  We 
typically  construct  these  term  lists  by  harvesting  Internet-available 
translation  resources,  so  the  foreign  language  terms  for  which  trans¬ 
lations  are  known  are  typically  an  eclectic  mix  of  root  and  inflected 
forms.  We  accommodate  this  limitation  using  a  four-stage  backoff 
statistical  stemming  approach  to  enhance  translation  coverage. 

2.1  Preprocessing. 

Differences  in  use  of  diacritic-s,  case,  and  punctuation  can  inhibit 
matching  between  term  list  entries  and  document  terms,  so  normal¬ 
ization  is  important.  In  order  to  maximize  the  probability  of  match¬ 
ing  document  words  with  term  list  entries,  we  normalize  the  bilin¬ 
gual  term  list  and  the  documents  by: 

•  converting  characters  in  Western  languages  to  lowercase, 

•  removing  all  accents  and  diacritics,  and 

•  segmentation,  which  for  Western  languages  merely  involves 
separating  punctuation  from  other  text  by  the  addition  of  white 
space. 

Our  preprocessing  also  includes  conversion  of  the  bilingual  term  list 
and  the  document  collection  into  standard  formats.  The  preprocess¬ 
ing  typically  requires  about  half  a  day  of  programmer  time. 

2.2  F our-Stage  Backoff  Translation. 

Bilingual  term  lists  found  on  the  Web  often  contain  an  eclectic 
mix  of  root  forms  and  morphological  variants.  We  thus  developed 
a  four-stage  backoff  strategy  to  maximize  coverage  while  limiting 
spurious  translations: 

1 .  Match  the  surface  form  of  a  document  term  to  surface  forms 
of  source  language  terms  in  the  bilingual  term  list. 


2.  Match  the  stem  of  a  document  term  to  surface  forms  of  source 
language  terms  in  the  bilingual  term  list. 
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3 .  Match  the  surface  form  of  a  document  term  to  stems  of  source 
language  terms  in  the  bilingual  term  list. 

4.  Match  the  stem  of  a  document  term  to  stems  of  source  lan¬ 
guage  terms  in  the  bilingual  term  list. 

The  process  terminates  as  soon  as  a  match  is  found  at  any  stage,  and 
the  known  translations  for  that  match  are  generated.  Although  this 
may  produce  an  inappropriate  morphological  variant  for  a  correct 
English  translation,  use  of  English  stemming  at  indexing  time  mini¬ 
mizes  the  effect  of  that  factor  on  retrieval  effectiveness.  Because  we 
are  ultimately  interested  in  processing  documents  in  any  language, 
we  may  not  have  a  hand-crafted  stemmer  available  for  the  document 
language.  We  have  thus  explored  the  application  of  rule  induction 
to  learn  stemming  rules  in  an  unsupervised  fashion  from  the  collec¬ 
tion  that  is  being  indexed  [2]. 

2.3  Balanced  Top-2  Translation. 

We  produce  exactly  two  English  terms  for  each  foreign-language 
term.  Eor  terms  with  no  known  translation,  the  untranslated  term  is 
generated  twice  (often  appropriate  for  proper  names  in  the  Latin- 
1  character  set).  For  terms  with  one  translation,  that  translation  is 
generated  twice.  For  terms  with  two  or  more  known  translations, 
we  generate  the  “best”  two  translations.  In  prior  experiments  we 
have  found  that  this  balanced  translation  strategy  significantly  out¬ 
performs  the  usual  (unbalanced)  technique  of  including  all  known 
translations  [1].  We  establish  the  “best”  translations  by  sorting  the 
bilingual  term  list  in  advance  using  only  English  resources.  All  single¬ 
word  translations  are  ordered  by  decreasing  unigram  frequency  in 
the  Brown  corpus,  followed  by  all  multi-word  translations,  and  fi¬ 
nally  by  any  single  word  entries  not  found  in  the  Brown  corpus. 
This  ordering  has  the  effect  of  minimizing  the  effect  of  infrequent 
words  in  non-standard  usages  or  of  misspellings  that  sometimes  ap¬ 
pear  in  bilingual  term  lists.  This  translation  strategy  allows  balanc¬ 
ing  of  translations  in  a  modular  fashion,  even  when  one  does  not 
have  access  to  the  internal  parameters  of  the  information  retrieval 
system.  We  translate  ~  100  MB  per  hour  using  Perl  on  a  SPARC 
Ultra  5. 

2.4  Post-translation  Document  Expansion. 

We  implement  post-translation  document  expansion  for  the  for¬ 
eign  language  stories  after  translation  into  English  in  order  to  en¬ 
rich  the  indexing  vocabulary  beyond  that  which  was  available  af¬ 
ter  term-by-term  translation.  This  is  analogous  to  the  process  that 
Singhal  et  al.  applied  to  monolingual  speech  retrieval  [4]. 

Term-by-term  translation  produces  a  set  of  English  terms  that  serve 
as  a  noisy  representation  of  the  original  source  language  document. 
These  terms  are  then  treated  as  a  query  to  a  comparable  English  col¬ 
lection,  typically  contemporaneous  newswire  text,  from  which  we 
retrieve  the  five  highest  ranked  documents.  From  those  five  docu¬ 
ments,  we  extract  the  most  selective  terms  and  use  them  to  enrich 
the  original  translations  of  the  documents.  For  this  expansion  pro¬ 
cess  we  select  one  instance  of  every  term  with  an  IDF  value  above 
an  ad  hoc  threshold  that  was  tuned  to  yield  approximately  50  new 
terms.  This  optional  step  is  the  slowest  processing  stage,  with  a 
throughput  of  about  20  MB  per  hour. 

2.5  Indexing 

The  resulting  collection  is  then  indexed  using  Inquery  (version 
3.1pl),  with  the  kstem  stemmer  and  default  English  stopword  list. 
Indexing  is  the  fastest  stage  in  the  process,  with  throughput  exceed¬ 
ing  one  gigabyte  per  hour. 

3.  INTERACTIVE  RETRIEVAL 


Interactive  searches  are  performed  using  a  Web  interface.  Sum¬ 
mary  information  for  the  top-ranked  documents  is  displayed  in  groups 
of  ten  per  page.  Document  summaries  consist  of  the  date  and  a  gloss 
translation  of  the  document  title.  Users  can  inspect  a  gloss  transla¬ 
tion  of  the  full  text  of  any  document  if  the  title  is  not  sufficiently 
informative.  For  both  title  and  full  text,  the  gloss  translations  are 
generated  in  advance  using  the  same  process  as  translation  for  in¬ 
dexing,  with  the  following  differences  in  detail: 

•  Terms  added  as  a  result  of  document  expansion  are  not  dis¬ 
played. 

•  The  number  of  retained  translations  is  separately  selectable 
for  the  title  and  for  full  text  indexing. 

•  Translations  are  not  duplicated  when  fewer  than  the  maximum 
allowable  number  of  translations  are  known. 

Our  goal  is  to  support  the  process  ot finding  documents,  with  the 
realization  that  the  process  of  using  documents  may  need  to  be  sup¬ 
ported  in  some  other  way  (e.g.,  by  forwarding  relevant  documents 
to  someone  who  is  able  to  read  that  language).  We  have  therefore 
designed  our  interface  to  highlight  the  query  terms  in  translated  doc¬ 
uments  and  to  facilitate  skimming  by  emphasizing  the  most  com¬ 
mon  translation  when  multiple  translations  are  displayed.  We  have 
found  that  such  displays  can  support  a  classification  task,  even  when 
the  translation  is  not  easy  to  read  [3].  Documents  must  be  classified 
by  the  user  as  relevant  or  not  relevant,  so  our  classification  results 
suggest  that  this  can  be  an  effective  user  interface  design. 

4.  RESULTS 

We  present  results  both  for  component-level  performance  of  our 
language-independent  retargeting  modules  and  an  assessmentof  the 
overall  retargeting  process. 

4.1  Component-level  Evaluation 

We  applied  our  retargeting  approach  and  retrieval  enhancement 
techniques  described  above  in  the  context  of  the  first  Cross-Language 
Evaluation  Forum’s  (CLEF)  multilingual  task.  We  used  the  English 
language  forms  of  the  queries  to  retrieve  English,  French,  German, 
and  Italian  documents.  Below  we  present  comparative  performance 
measures  for  two  of  the  main  processing  components  described  above 
-  statistical  stemming  backoff  translation  -  applied  to  the  English- 
French  cross-language  segment  of  the  CLEF  task.  The  post-translation 
document  expansion  component  was  applied  to  the  smaller  Topic 
Detection  and  Tracking  (TDT-3)  collection  to  improve  retrieval  of 
Mandarin  documents  using  English. 

4.1.1  Baseline  CLEF  System  Configuration 

Our  baseline  run  was  conducted  as  follows.  We  translated  the 
~  44,  000  documents  from  the  1994  issues  of  Le  Monde.  We  used 
the  English-French  bilingual  term  list  downloaded  from  the  Web  at 
http  :  / /www .  f  reedict .  com.  We  then  inverted  the  term  list 
to  form  a  35,000  term  French-English  translation  resource.  We  per¬ 
formed  the  necessary  document  and  term  list  normalization;  in  this 
case,  removing  accents  from  document  surface  forms  to  enable  match¬ 
ing  with  the  un-accentedterm  list  entries,  converting  case,  and  split¬ 
ting  clitic  contractions,  such  as  I  ’horlage,  on  punctuation.  We  trained 
the  statistical  stemming  rules  on  a  sample  of  the  bilingual  term  list 
and  document  collection  and  applied  these  rules  in  stemming  back¬ 
off.  Our  default  condition  was  run  with  top-2  balanced  translation 
using  the  Brown  corpus  as  a  source  of  target  language  unigram  fre¬ 
quency  information.  Translated  documents  were  then  indexed  with 


_ 

Stage  1 

Stage  2 

Stage  3 

Stage  4 

Match 

70% 

3% 

0.5% 

1% 

Table  1:  Percentage  of  document  terms  translated  at  each  stage 
of  4-stage  backoff  translation  with  statistical  stemming. 

the  InQuery  (version  3.1pl)  system,  using  the  kstem  stemmer  for 
English  stemming  and  InQuery ’s  default  English  stopword  list.  Long 
queries  were  formed  by  concatenating  the  title,  description,  and  nar¬ 
rative  fields  of  the  original  query  specification.  The  resulting  word 
sequence  was  enclosed  in  an  InQuery  #sum  operator,  indicating 
unweighted  sum. 

Our  figure  of  merit  for  the  evaluations  below  is  mean  (uninter¬ 
polated)  average  precision  computed  using  tree  xval  ^  across  the  34 
topics  in  the  CLEE  evaluation  for  which  relevant  French  documents 
are  known. 

4.1.2  Backoff  Translation  with  Statistical  Stemming 

We  first  contrast  the  above  baseline  system  with  the  effectiveness 

of  an  otherwise  identical  run  without  the  stemming  backoff  compo¬ 
nent.  Terms  in  the  documents  are  thus  only  translated  if  there  is  an 
exact  match  between  the  surface  form  in  the  document  and  a  surface 
form  in  the  bilingual  term  list.  We  find  that  mean  average  preci¬ 
sion  for  unstemmed  translation  is  0.19  as  compared  with  0.2919  for 
our  baseline  system  including  stemming  backoff  based  on  trained 
rules.  This  difference  is  significant  at  p  <  0.05,  by  paired  t-test, 
two-tailed.  The  per-query  effectiveness  is  illustrated  in  Figure  1. 
Backoff  translation  improves  translation  coverage  while  retaining 
relatively  high  precision  of  matching  in  contrast  to  unstemmed  ef¬ 
fectiveness. 

Backoff  translation  improves  cross-language  information  retrieval 
effectiveness  by  improving  translation  coverage  of  the  terms  in  the 
document  collection.  Using  the  statistical  stemmer,  by-token  cover¬ 
age  of  document  terms  increased  by  7coverage.  The  different  stages 
of  the  four-stage  backoff  process  contributed  as  illustrated  in  1.  The 
majority  of  terms  match  in  the  Stage  1  exact  match,  accounting  for 
70%  of  the  term  instances  in  the  documents.  The  remaining  stages 
each  account  for  between  0.5%  and  3%  of  the  document  terms,  while 
20%  of  document  term  instances  remain  untranslatable.  However, 
this  relatively  small  increase  in  coverage  results  in  the  highly  sig¬ 
nificant  improvement  in  retrieval  effectiveness  above. 

4.1.3  Top-2  Balanced  Translation 

Here  we  contrast  top-2  balanced  translation  with  top- 1  transla¬ 
tion.  We  retain  statistical  stemming  backoff  for  the  top-1  transla¬ 
tion.  We  replace  each  French  document  term  with  the  highest  ranked 
English  translation  by  target  language  unigram  frequency  in  the  Brown 
Corpus  as  detailed  above,  retaining  the  original  French  term  when 
no  translation  is  found  in  the  bilingual  term  list.  We  achieve  a  mean 
average  precision  of  0.2532  in  contrast  with  the  baseline  condition. 
This  difference  is  significant  at  p  <  0.01  by  paired  t-test,  two-tailed. 
We  can  effectively  incorporate  additional  translations  using  top-2 
balanced  translation  without  degrading  performance  by  introducing 
significant  additional  noise.  A  query-by-query  contrast  is  presented 
in  Figure  2. 

4.1.4  Document  Expansion 

We  evaluated  post-translation  document  expansion  using  the  Topic 
Detection  and  Tracking  (TDT-3)  collection.  For  this  evaluation,  we 
used  the  TDT-1999  topic  detection  task  evaluation  framework,  but 

^Available  at  ftp://ftp.cs.cornell.edu/pub/smart/. 


because  out  focus  in  this  paper  is  on  ranked  retrieval  effectiveness 
we  report  mean  uninterpolated  average  precision  rather  than  the  topic- 
weighted  detection  cost  measure  typically  reported  in  TDT  In  the 
topic  detection  task,  the  system  is  presented  with  one  or  more  exem¬ 
plar  stories  from  the  training  epoch — a  form  of  query-by-example — 
and  must  determine  whether  each  story  in  the  evaluation  epoch  ad¬ 
dresses  either  the  same  seminal  event  or  activity  or  some  directly 
related  event  or  activity.  This  is  generally  thought  to  be  a  some¬ 
what  narrower  formulation  than  the  more  widely  used  notion  of  top¬ 
ical  relevance,  but  it  seems  to  be  well  suited  to  query-by-example 
evaluations.  The  TDT-1999  tracking  task  was  multilingual,  search¬ 
ing  stories  in  both  English  and  Mandarin  Chinese,  and  multi-modal, 
involving  both  newswire  text  and  broadcast  news  audio.  We  fo¬ 
cus  on  the  cross-language  spoken  document  retrieval  component  of 
the  tracking  task,  using  English  exemplars  to  identify  on-topic  sto¬ 
ries  in  Mandarin  Chinese  broadcast  news  audio.  We  compare  top-1 
translation  of  the  Mandarin  Chinese  stories  with  and  without  post¬ 
translation  document  expansion.^  We  used  the  earlier  TDT-2  En¬ 
glish  newswire  text  collection  as  our  side  collection  for  expansion. 
We  perform  topic  tracking  on  60  topics  with  4  exemplars  each.  Here, 
we  report  the  mean  average  precision  on  the  55  topics  for  which 
there  are  on-topic  Mandarin  audio  stories.  The  mean  uninterpolated 
average  precision  for  retrieval  of  unexpanded  documents  is  0.36  while 
post-translation  document  expansion  raises  this  figure  to  0.4 1 .  This 
difference  is  significant  at  |)  <  0.01  by  paired  t-test,  two-tailed.  The 
contrast  is  illustrated  in  Figure  3.  Interestingly,  when  we  tried  this 
with  French,  we  noted  that  expansion  tended  to  select  terms  from 
the  few  foreign-language  documents  that  happened  to  be  present  in 
our  expansion  collection.  We  have  not  yet  explored  that  effect  in  de¬ 
tail,  but  this  observation  suggests  that  the  document  expansion  may 
be  sensitive  to  the  characteristics  of  the  expansion  collection  that  are 
not  immediately  apparent. 

4.2  The  Learning  Curve 

We  have  found  that  retargeting  can  be  accomplished  quite  quickly 
(a  day  without  document  expansion,  three  days  for  TREC-sized  col¬ 
lections  with  document  expansion),  but  only  if  the  required  infras¬ 
tructure  is  in  place.  Adapting  a  system  that  was  developed  initially 
for  Chinese  to  handle  French  documents  required  several  weeks, 
with  most  of  that  effort  invested  in  development  of  four-stage  back¬ 
off  translation  and  statistical  stemming.  Further  adapting  the  system 
to  handle  German  documents  revealed  the  importance  of  compound 
splitting,  a  problem  that  we  will  ultimately  need  to  address  by  incor¬ 
porating  a  more  general  segmentation  strategy  than  we  used  initially 
for  Chinese.  In  extending  the  system  to  Italian  we  have  found  that 
although  our  statistical  stemmer  presently  performs  poorly  in  that 
language,  we  can  achieve  quite  credible  results  even  with  a  fairly 
small  (17,313  term)  bilingual  term  list  using  afreely  available  Mus¬ 
cat  stemmer  (which  exist  for  ten  languages).  So  although  it  is  pos¬ 
sible  in  concept  to  retarget  to  a  new  language  in  just  a  few  days,  ex¬ 
tending  the  system  typically  takes  us  between  one  and  three  weeks 
because  we  are  still  climbing  the  learning  curve. 

5.  CONCLUSION 

By  building  on  the  lessons  learned  using  the  TREC,  CLEF,  NT- 
CIR,  and  TDT  collections,  we  have  sought  to  build  an  infrastructure 
that  can  be  applied  to  a  broad  array  of  languages.  Arabic  and  Ko¬ 
rean  collections  are  expected  to  become  available  in  the  next  year, 
and  we  are  now  evolving  our  interface  to  support  user  studies.  Our 
approach  is  distinguished  by  support  for  interactive  retrieval  even 

^  Since  Mandarin  Chinese  has  little  surface  morphology,  we  omit 
backoff  translation  in  this  case. 


Figure  1:  Comparison  of  effectiveness  of  backoff  versus  unstemmed  translation  of  French  documents:  Bars  above  x-axis  indicate 
backoff  transition  outperforms  unstemmed  translation. 
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Figure  2:  Comparison  of  effectiveness  of  top-2  balanced  versus  top-1  translation  of  French  documents:  Bars  above  x-axis  indicate 
“Top-2”  outperforms  “Top-1” 
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Figure  3:  Comparison  of  effectiveness  of  top-1  post-translation  document  expansion  versus  bare  top-1  translation  of  Chinese  docu¬ 
ments:  Bars  above  x-axis  indicate  document  expansion  outperforms  bare  translation 


in  languages  for  which  machine  translation  is  presently  unavailable, 
and  our  ultimate  goal  is  to  characterize  how  closely  we  can  approx¬ 
imate  the  retrieval  effectiveness  users  would  obtain  if  they  had  the 
best  available  machine  translations  for  the  retrieved  documents. 
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